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PACS 89 . 75 . He - Networks and genealogical trees 

PACS 89 . 75 . Fb - Structures and organization in complex systems 

PACS 05 . 90 . +m - Other topics in statistical physics, thermodynamics, and nonlinear dynamical 
systems 

Abstract. - In this paper we introduce a non-fuzzy measure which has been designed to rank 
the partitions of a network's nodes into overlapping communities. Such a measure can be use- 
ful for both quantifying clusters detected by various methods and during finding the overlapping 
community-structure by optimization methods. The theoretical problem referring to the sepa- 
ration of overlapping modules is discussed, and an example for possible applications is given as 
well. 
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^ Introduction. — Networks - in the sense they are 
Oused throughout the present paper - are basically graphs 

' ^ describing real-life complex systems taken from the most 
^-tiifFerent scientific areas, but primarily from biology, econ- 

F^omy and sociology. According to recent discoveries, real- 

i_^Jife networks tend to have some interesting and rather un- 
expected common properties, such as scale-free degree dis- 

^""^ tribution, strong disposition to form clusters (also called 
as communities or modules) or having the so called "small- 

^ world" property \^. 

Communities (groups of densely interconnected nodes) 
'^within these graphs often refer to the functional units of 
^^the corresponding complex systems, thus their exploration 
* has been a fundamental issue in the study of networks. 



However, as an important result, these clusters turned out 



. . not to be separate, but rather overlapping, sharing many 
^ edges and nodes. 

Because of the fundamental role clusters play in real-life 



^networks, many algorithms have been proposed with the 



aim of uncovering the community-structure of a variety of 
networks. Earlier ones primarily detect disjoint clusters [6] 
[t], meanwhile some of the recent ones detect overlapping 
modules as well (Slllllsl. 



At the same time, along with the development of the 
algorithms, arose the demand to define and measure 
somehow the "suitability" of the different partitions 
provided by the various methods. Moreover, the fact that 
the concept of "cluster" is not specified enough (in the 
sense that it does not have a widely accepted definition) 



makes this problem even more ambiguous. However, 
although some of the proposed measures have become 
widely accepted and used (for example the so called 
"Q-modularity" proposed by Newman and Girvan in |6j), 
they are defined only for non-overlapping community 
structures. 

Here we would like to note that fuzzy measures have 
been introduced with the same ambition (namely to mea- 
sure the "quality" of an overlapping community-structure, 
{ci, . . . , ck}) ^£j[^ but they share a common constraint: 
every i node has a "belonging factor" < a^.c,, < 1 which 
expresses how strongly node i belongs to the rth cluster 
Cr- The requirement is that 



K 



= 1 



(1) 



for all i belonging to the graph, K denoting the number 
of clusters. 

In other words, none of the nodes can belong to more 
than one community "strongly" (and, primarily, not 
"fully"). Recalling social networks, this means that if 
a person belongs - let's say - to her/his family fully 
(or "strongly"), then she/he can not belong to other 
communities, like working place, sport club, etc, only 
very "weakly", or nohow. We believe that this condition 
is often un-realistic in real-life cases, so our goal has been 
to define a measure without the above requirement. 
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Fig. 1: a) Measure based on the modules density will be opti- 
mal if all the edges constitute a separate cluster, b) An over- 
lapping node that belongs to both the ci and C2 communities. 
It contributes with positive values for both clusters, c) The ap- 
pearance of many similar or almost-the-same overlapping com- 
munities. 



We have obtained good results by utilizing the following 
expectations: (1) the edges of a given node should pri- 
marily go inward its' cluster(s) and should not go out- 
ward, and (2): clusters should be dense. The first criterion 
shows how "justifiable" it is to assign the node i{G c^) to 
the 7'th cluster c,.: it is the difference between the inward 
going edges {J2j<£c,^,i^j '^ij) ^^^^ the outward going edges 
{J2j^cr. ^ij)^ divided by the di degree of node i. Put it 
together, we get that every i node contributes to the rth 
cluster to which it belongs to with the following value: 



E a.. 



(2) 



where ai,j denotes the proper element of the adjacency 
matrix defining the network, interpreted as usually, that 
is, 



In brief, the purpose of the present paper is to de- 
fine a simple but well-usable non-fuzzy measure which, 
on the one hand, quantifies cluster-structures found by 
various methods on connected networks, and on the other 
hand, can be used to detect (overlapping) communities 
as well by directly optimizing it. For being well-usable, 
we expect from the measure to take values between -1 
and and 1, where a higher positive value corresponds to 
a better clustering. The zero value expresses random-like 
network-clustering, and negative values record disadvan- 
tageous ones. 

The proposed measure. ~ As mentioned above, 
the notion of "cluster" is not well defined: there are many 
approaches based on different "intuitive" characteristics 
of a community, such as its' denseness, the average 
path-length among its' nodes, the number of edges going 
in and out of a given module, the betweenness among 
nodes belonging to different communities, etc 
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Although theoretically, measures could be constructed 
based on any of the above characteristics, in practice, the 
most commonly used ones exploit the expectation that a 
cluster should be "dense" - or, as it is often formulated: 
modules are expected to have relatively more connections 
within themselves and than among each other [6|[8j|9]. 
Using the above expectation (clusters should be dense) 
and allowing overlapping community-structure leads to 
the result that separate edges will be returned as optimal 
community-structure - since these are the most dense 
subgraphs, see fig. [l] a. (This happens for example if 
one tries to apply Newman's Q-modularity directly onto 
structures where overlapping is enabled.) 

According to our experiments, none of the "intuitive 
approaches" is enough to create a suitable measure alone, 
because they result in "degenerated structures" to be op- 
timal ones, similar to the one seen above. On the other 
hand, combinations of approaches can handle this phe- 
nomenon. 



1 if « and j are connected, 
if not 



(3) 



The more edges go inward and the less edges go out- 
ward the cluster, the more the above ratio converges to 1. 
If more edges go outward than inward, the expression is 
negative, and if all of them go outward, the result is -1. 
Since a node can contribute with positive values to more 
than one clusters ~ due to the overlapping areas, see fig. [T] 
b - the whole network's modularity value is higher if a 
node like that belongs to both modules. 

To avoid community-structures having only a few com- 
munities with very high values, we add the criterion 
that all nodes have to belong to at least one module. (A 
trivial solution for that is, to put all the left-out nodes 
into a separate cluster at the end. We have obtained our 
results like this too.) Also, the appearance of many simi- 
lar or almost-the-same overlapping communities (as it can 
be seen on fig. [l] c) is avoidable by dividing the above 
expression by the number of clusters i belongs to, denoted 
by Si. Thus the rth cluster, will contribute to the final 
result with: 



a. 



E 



1 jeCr,i^j jfC-r 

rir^ di • Si 



(4) 



where n^^ is the number of nodes and is the number 
of edges that the rth cluster Cj. contains, respectively. 



The density of a module - which was our "second 
requirement" - is straightforward to interpret as 
This expression gives 1 if the rth module 



(which is a 

(sub)graph) contains all its' possible edges, and if it 
does not have any of them. Since the first factor ranges 
between -1 and 1, the second factor between and 1, the 
whole expression varies between -1 and 1. 
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Fig. 2: The question when to handle a (sub)graph as one com- 
munity and when as more, is non-trivial, because "intuition" 
gives different answers to different people. At the same time, 
most of us would agree on separating two 5-cliques overlapping 
in one single node (a), but handling them as one community, 
if they share 4 nodes (c). Cases between (b) are a matter of 
"taste" . 

This remains true for the final measure M°" as well, 
which is the average of the module- values: 

K 

^ J2 tliat is. 



1 ^ 

= — V 
K ^ 



E 



di-Si 



(5) 



one single 



Since the density of clusters containing 
node (when ric^ — 1) is not defined (because (2) is not 
defined), we simply set their M""" modularity value to 
zero. (Isolated nodes (when d = 0) can not appear, since 
the network assumed to be connected.) 

Here we would like to note that handling the unclustered 
nodes (nodes that do not belong to any of the modules) 
is possible in many ways. We have chosen to put them 
into a separate community, but some kind of weighting is 
also conceivable, when the weight is in inverse proportion 
to the number of the unclustered nodes (the more nodes 
are clustered, the higher the final score is). Furthermore, 
one can consider the weighting of the clusters according 
to their sizes as well. 

One cluster or more clusters? When to sepa- 
rate? — This question is highly non-trivial, because it is 
- up to a great extent - simply a matter of "intuition" or 
taste, being different from person to person. For example 
most of us would agree on separating two 5-cliques 
overlapping in one single node, but handling them as one 
community, if they share 4 nodes (see fig.|2|. But what is 
the case, if they share two or three nodes? 

Figure |3] describes how the introduced measure, M""" 
behaves with respect to the above question. Given a 
complete-graph with 712 = 50 nodes and a smaller one with 
ni nodes (ni e {1...50}, also complete-graph). These 
two graphs overlap in o nodes, where o G {1 . . . ni}. The 




Phase plot of overlapping asymetric cliques 




15 20 25 30 35 
size of smaller clique n\ 

Fig. 3: Given a complete-graph with n2 = 50 nodes and a 
smaller one with ni nodes (ni £ {1...50}, also complete- 
graph; n\ is shown on the horizontal axis). These two graphs 
overlap in o nodes (vertical axis), where o G {l...ni}. The 
n\ — o parameter-pairs generate two dissevering regions: the 
upper one is where the introduced measure, M"" gives higher 
score if the two graphs are handled as one module, while the 
lower one covers those n\ — o pairs, which give higher score if 
the graphs make up separate communities. 



horizontal axis shows the size of the smaller graph, ni, 
while the vertical axis shows the number of the overlap- 
ping nodes (o) between the two graphs. Two regions show 
up: the lower region covers the o — ni parameter-pairs by 
which gives higher score if the two graphs are han- 
dled as separate communities, while the upper one covers 
those ni-o pairs, which give higher score, if the overlap- 
ping graphs are handled as one module. One extreme is 
when the overlap is (the two graphs do not share any 
nodes, horizontal axis) - which obviously falls in the lower, 
"separate" -region. The other end- value is when they share 
all the ni nodes, that is, the smaller graph (the ni-clique) 
is a real sub-graph, a part of the bigger complete-graph - 
this case is represented by the diagonal line starting from 
the pole. 

An application. — CFinder, an algorithm designed 
to uncover the overlapping community-structure of net- 
works [2], has a "tuning-parameter" (fc) which determines 
the cohesiveness of the revealed modules: the higher the 
parameter k, the smaller, the more disintegrated, but at 
the same time the more cohesive are the detected com- 
munities. This is a result of the method, which exploits 
the observation, that a typical community consists of 
several complete subgraphs that tend to share many of 
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their nodes. The algorithm uncovers those modules which 
form so called "fc-clique communities" , that is, unions of 
fc-cliques that can be reached from each other through a 
series of adjacent fc-cliques. 

Theoretically k can be any positive integer starting from 
3, but in practice it is usually smaller than ten. (If A: = 2, 
CFinder detects the connected subgraphs, that is, those 
modules which are unions of 2-cliques (which are edges) 
and can be reached from each other through a series of 
adjacent edges.) The proper value of k depends on the 
network. In the following we define the most proper k 
for some real- life networks using the introduced measure. 

Figure |4]depicts the M°" scores as a function of the k pa- 
rameter for three real-life networks: (1) word association, 
(2) protein interaction, and (3) cond-mat publication. 

The nodes of the first graph, 'word association', are 
words which are linked if the people in a survey associated 
them with each other ^15^. (Originally it is a weighted, di- 
rected graph, where the weight of an edge indicates the 
frequency that the people associated the end point of the 
link with its' start point, but here we have used a simpli- 
fied - undirected, unweighted - form of it.) The 'protein 
interaction' network describes the protein-protein interac- 
tions in S. cerevisiae (see details in (l6]), and finally, the 
'cond-mat publication' network describes co-authorships 
among mathematicians, obtained from the Los Alamos 
cond-mat archive ( 17). (Originally this is a weighted 



graph as well, where the weights are proportional to the 
number of common works, but, here too, we have used 
a simplified, unweighted version of the graph, in which 
the edges have been eliminated under a certain threshold- 
weight. See more details in fl^.) 

As it can be seen on fig. |4] in the case of the protein- 
interaction network and the cond-mat publication, both 
curves reach their maximum at fc = 7, which is their opti- 
mum value for k. 

The word-association network displays a very interest- 
ing behavior: the whole curve is in the negative region. 
This is most probably due to the fact that this graph 
contains many words with several meanings, e.g., the 
word "bright" , which - according to the survey - is often 
associated with words having alternative meanings, like 
"smart", "light", "dark", "sun", etc. Accordingly, in 
a graph like this, if slightly overlapping modules arise 
around the diS^erent meanings of a word, and if between 
the nodes of these otherwise separate modules there are 
relatively many edges (associations) a negative numerator 
in M"" is resulted. 
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Fig. 4: The M"" scores as a function of the k "tuning- 
parameter" belonging to the CFinder algorithm, for three 
real-life networks: (1) cond-mat publication (topmost curve) 
(2) protein interaction and (3) word association (bottommost 
curve). The suggested fc- values are those where the curves 
reach their maximum. 



REFERENCES 

[1] Albert R. and Barabasi A.-L., Statistical mechanics 
of complex networks Reviews of Modern Physics, Vol. 74 
2002, p. 47-97. 

[2] Palla G., Derenyi I., Farkas I. and Vicsek T., Uncov- 
ering the overlapping community structure of complex net- 
works in nature and society Nature, Vol. 435 2005, p. 814- 
818. 

[3] Watts D.J. and Strogatz S. H., Collective dynamics of 
'small-world' networks Nature, Vol. 393 1998, p. 440-442. 

[4] Lancichinetti a., Fortunato S. and Kertesz J., De- 
tecting the overlapping and hierarchical community struc- 
ture in complex systems New J. Phys., Vol. 11 2009, 
p. 033015. 

[5] Adamcsek B., Palla C, Farkas L J., Derenyi L and 
Vicsek T., CFinder: Locating cliques and overlapping 
modules in biological networks Bioinformatics, Vol. 22 
2006, p. 1021-1023. 

[6] Newman M.E.J, and Girvan M., Finding and evaluating 
community structure in networks Phys. Rev. E., Vol. 69 
2004, p. 026113. 

[7] Newman M.E.J., Modularity and community structure in 
networks Proc. of the Nat. Academy of Sciences of the USA 
(PNAS), Vol. 103 2006, p. 8577-8582. 

[8] Leicht E.A. and Newman M.E.J., Community struc- 
ture in directed networks Phys. Rev. Lett., Vol. 100 2008, 



p-4 



Modularity measure of networks with overlapping communities 



p. 118703. 

[9] Nicosia V., Mangioni G., Carchiolo V. and Malgeri 
M., Extending the definition of modularity to directed 
graphs with overlapping communities J. Stat. Mech. 2009, 
p. P03024. 

[10] Nepusz T., Petroczi A., Negyessy L. and Bazso F., 
Fuzzy communities and the concept of bridgeness in com- 
plex networks Physical Review E, Vol. 77 2008, p. 016107. 

[11] Scott J., Social Network Analysts: A Handbook (Sage 
Publications, London) 2000. 

[12] Everitt B. S., Cluster Analysis (Edward Arnold, Lon- 
don) 1993. 

[13] Newman M.E.J., Detecting community structure in net- 
works Eur. Phys. J. B, Vol. 38 2004, p. 321-330. 

[14] Warner S., Eprints and the Open Archives Initiative Li- 
brary Hi Tech, Vol. 21(2) 2003, p. 151-158. 

[15] Nelson D. L., McEvoy C. L. and Schreiber 
T. A., The University of South Florida word 
association, rhyme, and word fragment norms, 
http:/ /www. usf.edu/FreeAssociation/ (1998). 

[16] Xenarios L, Rice D. W., Salwinski L., Baron M. K., 
Marcotte E. M. and Eisenberg D., DIP: the Database 
of Interacting Proteins Nucleic Acids Res, Vol. 28(1) 2000, 
p. 289-291. 

[17] http://arxiv.org/archive/cond-mat. 



p-5 



